This is a continuation of the initial study on County Subdivisions in the MPO. Given the relative lack of variation at the county subdivision level, this extension was intended to test the methodology used previously to tracts to take advantage of their higher levels of variation. Smaller geometries taken as a group tend to have higher levels of variation because there is less aggregation that masks variation. We know that the Boston region has been residentially segregated by race historically and in the present which is more accurately shown in smaller geographies. For example, in the majority minority City of Boston, the neighborhoods Roxbury, Dorchester, and Mattapan are where much of the Black population of Boston lives, while many other neighborhoods are majority white and the suburbs of Boston mostly remain highly majority white. Looking at different levels of geography allows exploration of patterns in demographic data that might not be found when aggregated to the county subdivision level.
https://www.bostonmagazine.com/news/2020/12/08/boston-segregation/
While the purpose of this was to see if the greater variation in the demographic data used at the tract level would show clustering, it does not look like it does. The overall pattern seems to be a large cluster made up by a sizeable range of data that is spread out over a gradiant instead of clustered, and some features that are outliers which DBSCAN categorizes as noise.
There are many other unsupervised machine learning algorithms that could be applied to this data in order to explore other results, but after doing some exploring of the data through this study, I would not suggest it. In general, we would apply other algorithms to capture clusters that have features that require other methods of conceptualizing them than the algorithms used. Even though it is hard to capture data in nine dimensions, from the visualizations of the results, it does not actually seem like there are multiple clusters no matter the algorithm.
Something to note is that this does not preclude unsupervised machine learning from being a useful tool in exploring patterns in demographic data, just not at this scale and with these particular variables. While useful clusters were not found, the pattern that was uncovered was the gradient of values for all the variables used. This confirms something that we have known for a while, using thresholds to define 'Equity' areas cannot capture the the true patterns of the data.
| Demographic | Tables | Fields | Calculation | Notes | |
|---|---|---|---|---|---|
| Race/Ethnicity | ACS14: B03002 | B03002_001-B03002_003 | Total - White Alone | ||
| Limited English Proficiency | ACS14: B16001 | B16001_005 + B16001_008 + B16001_011 + B16001_014 + B16001_017 + B16001_020 + B16001_023 + B16001_026 + B16001_029 + B16001_032 + B16001_035 + B16001_038 + B16001_041 + B16001_044 + B16001_047 + B16001_050 + B16001_053 + B16001_056 + B16001_059 + B16001_062 + B16001_065 + B16001_068 + B16001_071 + B16001_074 + B16001_077 + B16001_080 + B16001_083 + B16001_086 + B16001_089 + B16001_092 + B16001_095 + B16001_098 + B16001_101 + B16001_104 + B16001_107 + B16001_110 + B16001_113 + B16001_116 + B16001_119 | C16001 (less than very well)/(Total Population - (B01001_003 +B01001_027) | ||
| Median Income | ACS14: B19013 | B19013_001 | |||
| % of HH with income below 200% of poverty line | ACS14: C17002 | C17002_002E + C17002_003E + C17002_004E + C17002_005E + C17002_006E + C17002_007E | |||
| Low Income Households | ACS14: B19001, B19025, B11001 | All of B19001, B19025_001, B11001_001 | HH Income Ranges, Aggregate HH Income, Total HH | ||
| No Car Households | ACS14: B08201 | B08201_002 | HH with no vehicles available | ||
| Population Density | https://jtleider.github.io/censusdata/api.html | B01001_001/AREA | Total Pop / AREA | These are shape, also use total population data | |
| Children | ACS14: B01001, 2010Cen: P12 | (B01001_003 + B01001_004 + B01001_005 + B01001_006 + B01001_027 + B01001_028 + B01001_029 + B01001_030), (P012_003 + P012_004 + P012_005 + P012_006 + P012_027 + P012_028 + P012_029 + P012_030) | Boys under 18 plus girls under 18 | 0-17 | |
| Population Over 5 | ACS14: B01001, 2010Cen: P12 | B01001_001 - (B01001_003 +B01001_027), P012_001 - (P012_003 + P012_027) | Total Population - Children under 5 | 5+ | |
| Seniors | ACS14: B01001 | (B01001_020 + B01001_021 + B01001_022 + B01001_023 + B01001_024 + B01001_025) + (B01001_044 + B01001_045 + B01001_046 + B01001_047 + B01001_048 + B01001_049) | Men ages 65+ plus Women ages 65+ | 65+ | |
| People with Disabilities | ACS14: S1810 | S1810_C02_001E / S1810_C01_001E ( total pop with disability / total non institutionalized population) | Includes: Ambulatory, Hearing, Vision, Self-Care, Cognitive, Independent Living Difficulties | ||
| Total Population | ACS14: B01001, 2010Cen: P1 | B01001_001, P01_001 | Includes those housed in group quarters |
import os
os.environ["OMP_NUM_THREADS"] = "1" #no parallel processing b/c memory leak on windows
import numpy as np
import pandas as pd
import geopandas as gpd
import censusdata
import matplotlib
from matplotlib import pyplot
import matplotlib.cm as cm
import sklearn
from functools import reduce
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn import metrics
from scipy.spatial.distance import cdist
from sklearn.decomposition import PCA
import plotly.express as px
#set some parameters to make display of data nicer
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)
mpoTowns = ["Beverly","Boston","Braintree","Cambridge","Chelsea","Everett","Framingham","Franklin","Gloucester",
"Lynn","Malden","Marlborough","Medford","Melrose","Newton","Peabody","Quincy","Revere","Salem","Somerville",
"Waltham","Watertown","Weymouth","Woburn","Acton","Arlington","Ashland","Bedford","Bellingham","Belmont",
"Bolton","Boxborough","Brookline","Burlington","Canton","Carlisle","Cohasset","Concord","Danvers","Dedham",
"Dover","Essex","Foxborough","Hamilton","Hingham","Holbrook","Holliston","Hopkinton","Hudson","Hull","Ipswich",
"Lexington","Lincoln","Littleton","Lynnfield","Manchester-by-the-Sea","Marblehead","Marshfield","Maynard",
"Medfield", "Medway","Middleton","Milford","Millis","Milton","Nahant","Natick","Needham","Norfolk",
"North Reading","Norwell","Norwood","Randolph","Reading","Rockland","Rockport","Saugus","Scituate",
"Sharon","Sherborn","Southborough","Stoneham","Stow","Sudbury","Swampscott","Topsfield","Wakefield","Walpole",
"Wayland","Wellesley","Wenham","Weston","Westwood","Wilmington","Winchester","Winthrop","Wrentham"]
mpoNums = ['05595','16250','21850','26150','27900','32310','37490','37560','37995','38400','41095','43580','52490','55955',
'57880','59105','60015','68645','70150','74595','00380','01605','02130','04615','05070','07350','09840','11000',
'11525','15060','21990','24960','24925','30700','31085','31540','35215','35425','35950','37875','38715','39625',
'39835',
'40115','43895','45560','48955','56130','61380','62535','67665','68050','68260','72215','72600','73440','73790',
'77255','80230','80510','81035','04930','07740','09175','11315','14640','16495','17405','24820','25172','30455',
'39765','39975','41515','41690','44105','46050','50250','55745','56000','60785','72495','74175','78690','78972',
'82315','30210','31645','38855','50145','57775','60330','07000','13205','56585','81005','06365','41165','63165']
mpoNums = pd.read_csv(r'C:\Users\AtkinsonM\OneDrive - Commonwealth of Massachusetts\Documents\Jupyter_Home\MPO_Tracts_2012.csv',
usecols=['tractce'], dtype = {'tractce': str})['tractce'].tolist()
https://www.census.gov/prod/cen2010/doc/sf1.pdf
https://www.census.gov/programs-surveys/acs/technical-documentation/table-shells.2014.html
#create lists of fields needed for each calculation
#ACS 2014 columns
race = ['B03002_001E','B03002_003E'] #universe = total population
lep = ['B16001_005E','B16001_008E','B16001_011E','B16001_014E','B16001_017E','B16001_020E','B16001_023E','B16001_026E',
'B16001_029E','B16001_032E','B16001_035E','B16001_038E','B16001_041E','B16001_044E','B16001_047E','B16001_050E',
'B16001_053E','B16001_056E','B16001_059E','B16001_062E','B16001_065E','B16001_068E','B16001_071E','B16001_074E',
'B16001_077E','B16001_080E','B16001_083E','B16001_086E','B16001_089E','B16001_092E','B16001_095E','B16001_098E',
'B16001_101E','B16001_104E','B16001_107E','B16001_110E','B16001_113E','B16001_116E','B16001_119E'] #universe = population > age 5
medinc = ['B19013_001E'] #No Universe needed, but HH
povPeop = ['C17002_002E','C17002_003E','C17002_004E','C17002_005E','C17002_006E','C17002_007E']
nocar = ['B08201_002E'] #Universe = Households
u18 = ['B01001_003E','B01001_004E','B01001_005E','B01001_006E','B01001_027E','B01001_028E',
'B01001_029E','B01001_030E'] #universe = total population
sen = ['B01001_020E','B01001_021E','B01001_022E','B01001_023E','B01001_024E','B01001_025E',
'B01001_044E','B01001_045E','B01001_046E','B01001_047E','B01001_048E','B01001_049E']#universe = total population
dis = ['S1810_C02_001E','S1810_C01_001E'] #universe = total NON institutionalized population
acsHH = ['B11001_001E']
acsPlus5 = ['B01001_001E','B01001_003E', 'B01001_027E']
acsTotPop = ['B01001_001E']
#decennial census columns
totpop = ['P001_001E']
plus5 = ['P012_0001E','P012_0003E','P012_0027E']
hh = ['P018_0001E']
#################################GRAB ACS DATA###################################################
#MINORITY
#get table and reset index so that county sub is a field called muni
minority = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('tract', '*')]),
race).reset_index().rename(columns={'index':'muni'})
#create the actual minority field
minority['Minority#'] = minority['B03002_001E'] - minority['B03002_003E']
#filter down to just what is in the MPO
minority['tract'] = minority['muni'].astype(str).str[-6:]
len(minority[minority['tract'].isin(mpoNums)])
minority['tract'] = minority['muni'].astype(str).str[-6:]
minority=minority[minority['tract'].isin(mpoNums)]
#LEP
limEngProf = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('tract', '*')]),
lep).reset_index().rename(columns={'index':'muni'})
#create the actual LEP field
limEngProf['LEP#'] = limEngProf[list(limEngProf.columns)[:-1]].sum(axis=1)
#filter down to just what is in the MPO
limEngProf['tract'] = limEngProf['muni'].astype(str).str[-6:]
limEngProf=limEngProf[limEngProf['tract'].isin(mpoNums)]
#MEDIAN INCOME
medIncome = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('tract', '*')]),
medinc).reset_index().rename(columns={'index':'muni', 'B19013_001E':'MedianIncome'})
#filter down to just what is in the MPO
medIncome['tract'] = medIncome['muni'].astype(str).str[-6:]
medIncome=medIncome[medIncome['tract'].isin(mpoNums)]
#PEOPLE LIVING IN HOUSEHOLDS BELOW 200% OF THE POVERTY LINE
povHH = censusdata.download('acs5',2014,censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ),
('tract', '*')]),
povPeop).reset_index().rename(columns={'index':'muni'})
#create the actual HH Pov field
povHH['POV#'] = povHH[list(povHH.columns)[:-1]].sum(axis=1)
#filter down to just what is in the MPO
povHH['tract'] = povHH['muni'].astype(str).str[-6:]
povHH=povHH[povHH['tract'].isin(mpoNums)]
#NO CAR HOUSEHOLDS
noCarHH = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('tract', '*')]),
nocar).reset_index().rename(columns={'index':'muni', 'B08201_002E':'NoCarHH#'})
#filter down to just what is in the MPO
noCarHH['tract'] = noCarHH['muni'].astype(str).str[-6:]
noCarHH=noCarHH[noCarHH['tract'].isin(mpoNums)]
#CHILDREN
kids = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('tract', '*')]),
u18).reset_index().rename(columns={'index':'muni'})
#create the actual kids field
kids['Under18#'] = kids[list(kids.columns)[:-1]].sum(axis=1)
#filter down to just what is in the MPO
kids['tract'] = kids['muni'].astype(str).str[-6:]
kids=kids[kids['tract'].isin(mpoNums)]
#SENIORS
seniors = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('tract', '*')]),
sen).reset_index().rename(columns={'index':'muni'})
#create the actual LEP field
seniors['Over65#'] = seniors[list(seniors.columns)[:-1]].sum(axis=1)
#filter down to just what is in the MPO
seniors['tract'] = seniors['muni'].astype(str).str[-6:]
seniors=seniors[seniors['tract'].isin(mpoNums)]
#DISABILITY
disability = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('tract', '*')]),
dis,'e4bec76221ba04c7df76c7c580659bf1f54ed2c1',
'subject').reset_index().rename(columns={'index':'muni','S1810_C02_001E':'Disability#'})
#filter down to just what is in the MPO
#universe in same table pretty much only so do %calc here
disability['Disability%'] = disability['Disability#']/disability['S1810_C01_001E']
disability['tract'] = disability['muni'].astype(str).str[-6:]
disability=disability[disability['tract'].isin(mpoNums)]
#HOUSEHOLDS
households = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('tract', '*')]),
acsHH).reset_index().rename(columns={'index':'muni', 'B11001_001E':'Households#'})
#filter down to just what is in the MPO
households['tract'] = households['muni'].astype(str).str[-6:]
households=households[households['tract'].isin(mpoNums)]
#OVER FIVE
#CHILDREN
over5 = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('tract', '*')]),
acsPlus5).reset_index().rename(columns={'index':'muni'})
#create the actual Over 5 field
over5['Over5#'] = (over5['B01001_001E'] - (over5['B01001_003E'] + over5['B01001_027E']))
#filter down to just what is in the MPO
over5['tract'] = over5['muni'].astype(str).str[-6:]
over5=over5[over5['tract'].isin(mpoNums)]
#TOTAL POPULATION
acsPop = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('tract', '*')]),
acsTotPop).reset_index().rename(columns={'index':'muni', 'B01001_001E':'TotalPopulation'})
#filter down to just what is in the MPO
acsPop['tract'] = acsPop['muni'].astype(str).str[-6:]
acsPop=acsPop[acsPop['tract'].isin(mpoNums)]
#Merge ACS data into one dataframe and make %
#collect all the data (and filter)
acsdata = [minority[['muni', 'Minority#']], limEngProf[['muni', 'LEP#']], medIncome[['muni', 'MedianIncome']],
povHH[['muni','POV#']], noCarHH[['muni', 'NoCarHH#']], kids[['muni', 'Under18#']], seniors[['muni', 'Over65#']],
disability[['muni', 'Disability#', 'Disability%']], households[['muni','Households#']], over5[['muni','Over5#']],
acsPop[['muni','TotalPopulation']]]
#actually do the merging
acsDF = reduce(lambda left,right: pd.merge(left,right,on=['muni'], how='outer'), acsdata)
#make the % fields
acsDF['Minority%'] = acsDF['Minority#']/acsDF['TotalPopulation']
acsDF['LEP%'] = acsDF['LEP#']/acsDF['Over5#']
acsDF['noCarHH%'] = acsDF['NoCarHH#']/acsDF['Households#']
acsDF['Under18%'] = acsDF['Under18#']/acsDF['TotalPopulation']
acsDF['Over65%'] = acsDF['Over65#']/acsDF['TotalPopulation']
acsDF['pov%'] = acsDF['POV#']/acsDF['TotalPopulation']
#bring in shapefile for area for population density
tracts= gpd.read_file(r'C:\Users\AtkinsonM\OneDrive - Commonwealth of Massachusetts\Documents\Jupyter_Home\tl_2012_25_tract')
#get area in sq miles (default to square meters because of CRS)
tracts['Area'] = tracts['ALAND']/2589988
#tracts[tracts['TRACTCE'].duplicated]
tracts=tracts[tracts['COUNTYFP'].isin(['017','021','025','009','023','027'])]
#make a join field for bringing in the area from shapefile
acsDF['tract'] = acsDF.muni.astype(str).str.slice(-6, None)
#join area to attribute table
allvarDF = pd.merge(acsDF,tracts[['TRACTCE','Area']],left_on = 'tract', right_on='TRACTCE', how='left')
#create population density field (in people per square mile)
allvarDF['PopDen']=allvarDF['TotalPopulation']/allvarDF['Area']
#update data to reflect more accurate depictions of suppressed data and null data
allvarDF.loc[allvarDF['MedianIncome']<0, 'MedianIncome']=0
allvarDF = allvarDF.fillna(0)
Make graphs of all the variables to see what the range looks like
#create county field for color symbology based on county in census geo field (muni)
allvarDF['county'] = allvarDF.muni.astype(str).str.split(pat=',', n=2, expand=False).map(lambda x:x[1])
#make scatter plots for all the variables with each other
varscat = px.scatter_matrix(
allvarDF,
dimensions=['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%','Over65%','Disability%',
'pov%'],
color="county",
width=2000, height=1400
)
varscat.update_traces(diagonal_visible=False)
varscat.update_yaxes(automargin=True)
varscat.show()
So you don't have to scroll if you don't want:
Analysis is also used below in testing the fit of the DBSCAN model clusters to the data.
X=allvarDF[['Minority%','LEP%',
'pov%','tract']].set_index('tract')
range_n_clusters = [2, 3, 4, 5, 6,7,8,9,10]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = pyplot.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = metrics.silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = metrics.silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X.iloc[:, 0], X.iloc[:, 2], marker='.', s=30, lw=0, alpha=0.7,
c=colors, edgecolor='k') #this only maps the locations of one variable per index (using MedianIncome here)
#so shape is not going to be accurate on any of the scatter plots below
#this is why I use the parallel lines chart
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
pyplot.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
pyplot.show()
For n_clusters = 2 The average silhouette_score is : 0.6009070729116017 For n_clusters = 3 The average silhouette_score is : 0.4890269322699995 For n_clusters = 4 The average silhouette_score is : 0.3882310873223278 For n_clusters = 5 The average silhouette_score is : 0.37401156660955653 For n_clusters = 6 The average silhouette_score is : 0.3299345213013047 For n_clusters = 7 The average silhouette_score is : 0.34331472606727814 For n_clusters = 8 The average silhouette_score is : 0.34387206799489645 For n_clusters = 9 The average silhouette_score is : 0.3399574171146125 For n_clusters = 10 The average silhouette_score is : 0.3250261147224297
https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/
#make elbow diagram
X=allvarDF[['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%','Over65%','Disability%',
'pov%','tract']].set_index('tract')
distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(1, 10)
for k in K:
# Building and fitting the model
kmeanModel = KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)
distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
'euclidean'), axis=1)) / X.shape[0])
inertias.append(kmeanModel.inertia_)
mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,
'euclidean'), axis=1)) / X.shape[0]
mapping2[k] = kmeanModel.inertia_
#distortion elbow diagram
matplotlib.pyplot.plot(K, distortions, 'bx-')
matplotlib.pyplot.xlabel('Values of K')
matplotlib.pyplot.ylabel('Distortion')
matplotlib.pyplot.title('The Elbow Method using Distortion')
matplotlib.pyplot.show()
#This shows an elbow at 3 clusters
#inertia elbow diagram
matplotlib.pyplot.plot(K, inertias, 'bx-')
matplotlib.pyplot.xlabel('Values of K')
matplotlib.pyplot.ylabel('Inertia')
matplotlib.pyplot.title('The Elbow Method using Inertia')
matplotlib.pyplot.show()
#This shows an elbow at 3 clusters
#testing K-means
kmeans = KMeans(n_clusters=3,init='k-means++', random_state=0).fit(X)
#put clusters into dataset
X['KmeansClusters'] = pd.Series(kmeans.labels_, index=X.index)
X
| Minority% | LEP% | MedianIncome | noCarHH% | PopDen | Under18% | Over65% | Disability% | pov% | KmeansClusters | |
|---|---|---|---|---|---|---|---|---|---|---|
| tract | ||||||||||
| 744101 | 0.12 | 0.09 | 87976 | 0.04 | 1234.64 | 0.22 | 0.12 | 0.08 | 0.15 | 1 |
| 020303 | 0.46 | 0.07 | 85119 | 0.48 | 20270.07 | 0.12 | 0.06 | 0.10 | 0.18 | 1 |
| 040401 | 0.20 | 0.12 | 57569 | 0.41 | 14994.63 | 0.18 | 0.11 | 0.19 | 0.36 | 0 |
| 040801 | 0.55 | 0.22 | 54159 | 0.31 | 11339.68 | 0.18 | 0.09 | 0.09 | 0.45 | 0 |
| 050101 | 0.75 | 0.44 | 42763 | 0.42 | 48262.42 | 0.23 | 0.08 | 0.15 | 0.51 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 400100 | 0.26 | 0.09 | 62303 | 0.36 | 14592.37 | 0.10 | 0.15 | 0.10 | 0.25 | 0 |
| 400700 | 0.24 | 0.06 | 137688 | 0.16 | 14490.47 | 0.15 | 0.11 | 0.04 | 0.09 | 2 |
| 402500 | 0.10 | 0.02 | 109695 | 0.17 | 1095.50 | 0.15 | 0.21 | 0.19 | 0.05 | 1 |
| 410400 | 0.11 | 0.03 | 85523 | 0.02 | 1668.24 | 0.24 | 0.10 | 0.10 | 0.11 | 1 |
| 415200 | 0.20 | 0.02 | 81810 | 0.10 | 909.90 | 0.19 | 0.16 | 0.11 | 0.12 | 1 |
690 rows × 10 columns
#Show Kmeans Clusters on a map
Kmap = tracts.merge(X.reset_index(), how = 'right', left_on = 'TRACTCE', right_on = 'tract')
#Kmap.plot(column='KmeansClusters')
fig = px.choropleth(Kmap,
geojson=Kmap.geometry,
locations=Kmap.index,
color="KmeansClusters",
projection="mercator")
fig.update_geos(fitbounds="locations", visible=False)
fig.show()
#Visualize Clusters with Parallel Coordinates Plot
fig = px.parallel_coordinates(X, color="KmeansClusters",
dimensions=['Minority%','LEP%','MedianIncome',
'noCarHH%','PopDen','Under18%','Over65%','Disability%','pov%'] ,
labels={'Minority%':'Minority %', 'LEP%':'LEP %',
'pov%':'Poverty %', },
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
Essentially - we do get three clusters and they are interesting
What the above parallel coordinates plot show is a visualization of the clusters based on the data for each town. As you can see, Pink (1) and Yellow (2) are pretty similar, but Blue (0) looks like a combination of outliers and data following a different pattern. Something that DBSCAN does naturally is not assign outliers to clusters. See below.
#https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd
#Set Parameters
Y = allvarDF[['Minority%','LEP%','MedianIncome','noCarHH%',
'PopDen','Under18%','Over65%','Disability%','pov%', 'tract']].set_index('tract')
#Minimum Samples
#presume that min number of neighbors should be between 9 and 18 (the number of vectors, and double the number)
#Eps
#use nearest neighbors to estimate
neighbors = NearestNeighbors(n_neighbors=9) #use min neighbors here
neighbors_fit = neighbors.fit(Y) #actually calculate
distances, indices = neighbors_fit.kneighbors(Y) #grab results
#prepare and map
distances = np.sort(distances, axis=0)
distances = distances[:,1]
eps_plot = px.line(distances, title='average distance between each point and its n_neighbors')
eps_plot.show()
#Graph shows where diminishing returns is at the elbow - in this case 4934
#try DBSCAN (https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py)
# Compute DBSCAN
db = DBSCAN(eps=4339, min_samples=18).fit(Y)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
#This is a Clustering Performance Evaluation - especially because no ground truth
#see: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
print("Silhouette Score: %0.3f"
% metrics.silhouette_score(Y, labels))
#As a note: the higher the Silouette Coefficient is, the model has better defined clusters. (-1 to +1)
#Here we do not have very well defined clusters.
Estimated number of clusters: 2 Estimated number of noise points: 293 Silhouette Score: 0.132
Quality of the clusters is low as seen from the Silhouette score. However, we may as well visualize it to see the results.
As you can see below, there is significant overlap between the two clusters and the noise. While we have many more dimensions that could be explored, the low silhouette score's proximity to zero means that the clusters likely do overlap to some extent (in 9D) which breaks the point of having multiple clusters in the first place.
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [pyplot.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
#plot the core cluster members
xy = Y[class_member_mask & core_samples_mask]
pyplot.plot(xy.iloc[:, 0], xy.iloc[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
#plot the border cluster members and the noise
xy = Y[class_member_mask & ~core_samples_mask]
pyplot.plot(xy.iloc[:, 0], xy.iloc[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
pyplot.title('Estimated number of clusters: %d' % n_clusters_)
pyplot.show()
As before, dimension reduction for visualization and possible functional improvements by looking at the relationships of the variables to each other instead of the variable values themselves.
https://drive.google.com/file/d/1YdA-HHYP1V05QgvwLCvfnuuau67Zl38n/view
#Standardize
#"Mathematically, this can be done by subtracting the mean and dividing
#by the standard deviation for each value of each variable."
#"the reason why it is critical to perform standardization prior to PCA,
#is that the latter is quite sensitive regarding the variances of the initial variables."
# from https://builtin.com/data-science/step-step-explanation-principal-component-analysis
#create empty df and add all the columns but standardized.
std = allvarDF[['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%','Over65%',
'Disability%','pov%', 'county','tract']].set_index('tract')
for x in ['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%','Over65%','Disability%','pov%']:
std[x] = (std[x] - std[x].mean())/std[x].std()
std
| Minority% | LEP% | MedianIncome | noCarHH% | PopDen | Under18% | Over65% | Disability% | pov% | county | |
|---|---|---|---|---|---|---|---|---|---|---|
| tract | ||||||||||
| 744101 | -0.73 | -0.19 | 0.25 | -7.73e-01 | -0.76 | 0.57 | -0.04 | -0.50 | -0.44 | Worcester County |
| 020303 | 0.61 | -0.34 | 0.17 | 1.83e+00 | 0.62 | -0.78 | -0.96 | -0.03 | -0.21 | Suffolk County |
| 040401 | -0.41 | 0.05 | -0.59 | 1.44e+00 | 0.24 | 0.03 | -0.12 | 1.63 | 0.91 | Suffolk County |
| 040801 | 0.98 | 1.00 | -0.68 | 8.10e-01 | -0.03 | 0.01 | -0.44 | -0.15 | 1.47 | Suffolk County |
| 050101 | 1.77 | 2.93 | -0.99 | 1.49e+00 | 2.64 | 0.79 | -0.58 | 0.90 | 1.87 | Suffolk County |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 400100 | -0.18 | -0.16 | -0.46 | 1.15e+00 | 0.21 | -1.17 | 0.41 | 0.03 | 0.18 | Norfolk County |
| 400700 | -0.25 | -0.42 | 1.61 | -6.40e-02 | 0.20 | -0.45 | -0.23 | -1.24 | -0.79 | Norfolk County |
| 402500 | -0.82 | -0.81 | 0.84 | 6.36e-03 | -0.77 | -0.48 | 1.44 | 1.59 | -1.03 | Norfolk County |
| 410400 | -0.79 | -0.71 | 0.18 | -9.28e-01 | -0.73 | 0.94 | -0.27 | -0.07 | -0.65 | Norfolk County |
| 415200 | -0.41 | -0.80 | 0.08 | -4.16e-01 | -0.78 | 0.13 | 0.64 | 0.24 | -0.61 | Norfolk County |
690 rows × 10 columns
#For before below but basically show the categories by % explanatory power for variance with additive PCs
#actually run PCA again and show results
pca = PCA(whiten=True) #all variables/components kept
#components are the eigenvalues/vectors
components = pca.fit_transform(std[['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%',
'Over65%','Disability%','pov%']])
pyplot.plot(pca.explained_variance_ratio_.cumsum(), marker = 'o')
pyplot.title('Explained Variance by Components')
pyplot.xlabel('Number of Components')
pyplot.ylabel('Cumulative Explained Variance')
#Just a note - PC1 is 0, PC2 is 1-0 etc. - looks like we should use 3 PCs according to
# https://365datascience.com/tutorials/python-tutorials/pca-k-means/
# they say the rule of thumb is to preserve about 80% of variance, and 2 Pcs is a little below 80 and 3 is above.
Text(0, 0.5, 'Cumulative Explained Variance')
#actually run PCA again and show results
pca = PCA(whiten=True) #all variables/components kept
#components are the eigenvalues/vectors
components = pca.fit_transform(std[['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%',
'Over65%','Disability%','pov%']])
#for each principal component, what is the percent of variance it explains
labels = {
str(i): f"PC {i+1} ({var:.1f}%)"
for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}
#display!
#range is three because I know after this point the Principal Components have diminishing returns
pcaPlot = px.scatter_matrix(
components,
labels=labels,
dimensions=range(4),
color=std["county"]
)
pcaPlot.update_traces(diagonal_visible=False)
pcaPlot.show()
Above you can see the results of the PCA. Given that PC1-3 gives us 79% of the variation explained, we will just use the first three PCs in order to not get as close to 90% which retains more data than I really want to given that we are trying to not just replace the data with relationships but instead choose a few substantial relationships and reduce the dimensions.
#make new df for use in reruns of algorithms
postPCA = pd.concat([allvarDF, pd.DataFrame(components)], axis=1)
postPCA.columns.values[-9:] = ['PC1', 'PC2','PC3','PC4', 'PC5','PC6','PC7', 'PC8','PC9']
#########################make elbow diagram#########################
K = range(1, 10)
distortions_PCA = []
inertias_PCA = []
mapping1_PCA = {}
mapping2_PCA = {}
K_PCA = range(1, 10)
pp = postPCA[['PC1', 'PC2','PC3', 'tract']].set_index('tract')
for k in K:
# Building and fitting the model
kmeanModel_PCA = KMeans(n_clusters=k).fit(pp)
kmeanModel_PCA.fit(pp)
distortions_PCA.append(sum(np.min(cdist(pp, kmeanModel_PCA.cluster_centers_,
'euclidean'), axis=1)) / pp.shape[0])
inertias_PCA.append(kmeanModel_PCA.inertia_)
mapping1_PCA[k] = sum(np.min(cdist(pp, kmeanModel_PCA.cluster_centers_,
'euclidean'), axis=1)) / pp.shape[0]
mapping2_PCA[k] = kmeanModel_PCA.inertia_
#distortion elbow diagram
matplotlib.pyplot.plot(K_PCA, distortions_PCA, 'bx-')
matplotlib.pyplot.xlabel('Values of K')
matplotlib.pyplot.ylabel('Distortion')
matplotlib.pyplot.title('The Elbow Method using Distortion')
matplotlib.pyplot.show()
#This doesn't show much of an elbow - use 3? 4?
#inertia elbow diagram
matplotlib.pyplot.plot(K_PCA, inertias_PCA, 'bx-')
matplotlib.pyplot.xlabel('Values of K')
matplotlib.pyplot.ylabel('Inertia')
matplotlib.pyplot.title('The Elbow Method using Inertia')
matplotlib.pyplot.show()
#This shows an elbow at 3/4? clusters
X=postPCA[['PC1', 'PC2','PC3', 'tract']].set_index('tract')
range_n_clusters = [2, 3, 4, 5, 6,7,8]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = pyplot.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = metrics.silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = metrics.silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X.iloc[:, 0], X.iloc[:, 2], marker='.', s=30, lw=0, alpha=0.7,
c=colors, edgecolor='k') #this only maps the locations of one variable per index (using MedianIncome here)
#so shape is not going to be accurate on any of the scatter plots below
#this is why I use the parallel lines chart
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
pyplot.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
pyplot.show()
For n_clusters = 2 The average silhouette_score is : 0.34411692954204953 For n_clusters = 3 The average silhouette_score is : 0.40651226662478807 For n_clusters = 4 The average silhouette_score is : 0.3605069811364715 For n_clusters = 5 The average silhouette_score is : 0.3362338517306651 For n_clusters = 6 The average silhouette_score is : 0.33254984317799974 For n_clusters = 7 The average silhouette_score is : 0.31749164987133066 For n_clusters = 8 The average silhouette_score is : 0.30168307046158843
#testing K-means
kmeans_PCA = KMeans(n_clusters=3,init='k-means++', random_state=0).fit(pp)
#put clusters into dataset
postPCA['KmeansClusters'] = pd.Series(kmeans_PCA.labels_, index=postPCA.index)
postPCA
| muni | Minority# | LEP# | MedianIncome | POV# | NoCarHH# | Under18# | Over65# | Disability# | Disability% | ... | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | KmeansClusters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Census Tract 7441.01, Worcester County, Massac... | 902 | 612 | 87976 | 1089 | 124 | 1600 | 873 | 556 | 0.08 | ... | -0.66 | -0.33 | -0.40 | -0.37 | 0.30 | -0.67 | -4.88e-01 | -0.49 | -0.08 | 0 |
| 1 | Census Tract 203.03, Suffolk County, Massachus... | 1600 | 239 | 85119 | 641 | 761 | 435 | 210 | 295 | 0.10 | ... | 0.53 | 1.20 | -0.17 | 0.21 | -0.98 | 1.13 | 1.90e+00 | -1.15 | -1.99 | 2 |
| 2 | Census Tract 404.01, Suffolk County, Massachus... | 512 | 267 | 57569 | 908 | 489 | 453 | 285 | 472 | 0.19 | ... | 0.74 | -0.19 | 0.95 | -0.07 | -1.74 | -0.11 | -7.96e-01 | -2.46 | -1.46 | 1 |
| 3 | Census Tract 408.01, Suffolk County, Massachus... | 2305 | 854 | 54159 | 1878 | 654 | 748 | 388 | 387 | 0.09 | ... | 0.98 | -0.23 | -0.42 | -0.10 | 0.73 | -0.49 | 8.03e-01 | -1.51 | 1.28 | 1 |
| 4 | Census Tract 501.01, Suffolk County, Massachus... | 4193 | 2222 | 42763 | 2872 | 745 | 1300 | 473 | 838 | 0.15 | ... | 2.28 | -0.33 | -0.89 | 1.70 | 0.35 | 0.09 | -2.45e+00 | 1.09 | -0.60 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 685 | Census Tract 4001, Norfolk County, Massachuset... | 1280 | 438 | 62303 | 1199 | 848 | 477 | 718 | 503 | 0.10 | ... | 0.32 | 0.86 | 0.99 | 0.09 | 0.20 | -0.24 | 7.58e-01 | -1.15 | -0.34 | 2 |
| 686 | Census Tract 4007, Norfolk County, Massachuset... | 828 | 199 | 137688 | 312 | 212 | 501 | 363 | 122 | 0.04 | ... | -0.68 | 0.97 | -0.72 | 0.85 | 0.29 | 1.37 | 5.83e-01 | -0.37 | 0.55 | 0 |
| 687 | Census Tract 4025, Norfolk County, Massachuset... | 461 | 86 | 109695 | 249 | 290 | 665 | 975 | 710 | 0.19 | ... | -0.72 | -0.44 | 1.87 | 0.45 | -0.81 | 1.66 | -8.46e-03 | -0.50 | -1.78 | 0 |
| 688 | Census Tract 4104, Norfolk County, Massachuset... | 721 | 188 | 85523 | 763 | 43 | 1615 | 693 | 654 | 0.10 | ... | -0.77 | -0.51 | -0.41 | -0.66 | -0.75 | -0.70 | -5.16e-01 | 0.10 | -0.67 | 0 |
| 689 | Census Tract 4152, Norfolk County, Massachuset... | 1143 | 108 | 81810 | 673 | 240 | 1050 | 908 | 617 | 0.11 | ... | -0.61 | -0.35 | 0.61 | -0.24 | -0.30 | -0.30 | 6.37e-01 | 0.21 | -0.48 | 0 |
690 rows × 34 columns
Here we used the k value of 3 because while the elbow diagrams were inconclusive, the silhouette score was clear and still corresponds to the inertia elbow diagram. Below is the results plotted in 3D. The clusters all seem to be attached, which makes one question if there should even be more than 1 cluster anyways.
#3D map showing Kmeans clusters but also shows the used PCs in space
fig_3D = px.scatter_3d(postPCA, x='PC1', y='PC2', z='PC3', color ='KmeansClusters' )
fig_3D.show()
#Show Kmeans Clusters on a map
Kmap_PCA = tracts.merge(postPCA.reset_index(), how = 'right', left_on = 'TRACTCE', right_on = 'tract')
#Kmap.plot(column='KmeansClusters')
fig_PCA = px.choropleth(Kmap_PCA,
geojson=Kmap_PCA.geometry,
locations=Kmap_PCA.index,
color="KmeansClusters",
projection="mercator")
fig_PCA.update_geos(fitbounds="locations", visible=False)
fig_PCA.show()
#Visualize Clusters with Parallel Coordinates Plot
fig_PCA2 = px.parallel_coordinates(postPCA, color="KmeansClusters",
dimensions=['PC1', 'PC2', 'PC3'] ,
labels={'PC1':'Principal Component 1','PC2':'Principal Component 2',
'PC3':'Principal Component 3', },
color_continuous_scale=px.colors.sequential.Plasma)
fig_PCA2.show()
While the Parallel Coordinates Plot does show how the clusters out of K-means are related to their own and other features (tracts). However, the pattern that each cluster is centered around are all overlapping another cluster for a large number of features. This shows more of a gradient pattern again than a cluster pattern. To confirm that and the theory that there still is only one major cluster with noise, we run DBSCAN again.
#https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd
#Set Parameters
ypca = postPCA[['PC1', 'PC2','PC3', 'tract']].set_index('tract')
#Minimum Samples
#presume that min number of neighbors should be between 3 and 6 (the number of vectors, and double the number)
#Eps
#use nearest neighbors to estimate
neighbors_pca = NearestNeighbors(n_neighbors=6) #after running DBSCAN a few times looks like 6 is optimal
neighbors_fit_pca = neighbors_pca.fit(ypca) #actually calculate
distances_pca, indices = neighbors_fit_pca.kneighbors(ypca) #grab results
#prepare and map
distances_pca = np.sort(distances_pca, axis=0)
distances_pca = distances_pca[:,1]
eps_plot_pca = px.line(distances_pca, title='average distance between each point and its n_neighbors')
eps_plot_pca.show()
#Graph shows where diminishing returns is at the elbow - in this case 0.72? 0.86?
# #############################################################################
# Compute DBSCAN
ypca_db = DBSCAN(eps=0.69, min_samples=6).fit(ypca)
core_samples_mask_pca = np.zeros_like(ypca_db.labels_, dtype=bool)
core_samples_mask_pca[ypca_db.core_sample_indices_] = True
labels_pca = ypca_db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_pca = len(set(labels_pca)) - (1 if -1 in labels_pca else 0)
n_noise_pca = list(labels_pca).count(-1)
print('Estimated number of clusters: %d' % n_clusters_pca)
print('Estimated number of noise points: %d' % n_noise_pca)
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(ypca, labels_pca))
Estimated number of clusters: 1 Estimated number of noise points: 35 Silhouette Coefficient: 0.496
Yet again, good fit with a bad result (e.g. one cluster).
#3D map showing Kmeans clusters but also shows the used PCs in space
ypca['DBSCAN_Clusters'] = labels_pca
fig_3D_DB = px.scatter_3d(ypca, x='PC1', y='PC2', z='PC3', color ='DBSCAN_Clusters' )
fig_3D_DB.show()
#YELLOW IS CLUSTER, BLUE IS NOISE